ndcg_score — Normalized Discounted Cumulative Gain (NDCG)#
NDCG is a ranking metric: it evaluates how well a model orders items so that the most relevant ones appear near the top.
Typical use cases:
web search / information retrieval
recommender systems (ranked lists)
learning-to-rank models
In this notebook you will:
build intuition for DCG (gain + position discount) and NDCG (normalization)
work through a concrete, hand-computed example with plots
implement
dcg@k/ndcg@kfrom scratch in NumPy (and sanity-check vs scikit-learn)see how NDCG is used when training a simple (linear) ranking model
Quick import#
from sklearn.metrics import ndcg_score
1) Ranking is not classification#
For a query \(q\) (a search query, a user in a recommender system, …) you usually have many candidate items:
true (graded) relevance labels: \(y_{q1}, \dots, y_{qn}\), with \(y_{qi} \ge 0\)
model scores: \(s_{q1}, \dots, s_{qn}\) (any real numbers)
Sorting scores induces a ranking (a permutation) \(\pi_q\) such that:
The top positions matter most (users rarely scroll), and labels can be graded (e.g., 0 = irrelevant, 3 = perfect).
NDCG answers: “How good is my ranked list compared to the best possible ranking?”
2) DCG: reward relevant items early#
Let \(\mathrm{rel}_i\) be the true relevance of the item placed at rank \(i\).
A common choice is:
gain: \(g(\mathrm{rel}) = 2^{\mathrm{rel}} - 1\) (amplifies high relevance)
discount: \(d(i) = \frac{1}{\log_2(i + 1)}\) (penalizes lower ranks)
Then the Discounted Cumulative Gain at cutoff \(k\) is:
Notes:
Only the order of scores matters (any strictly monotonic transform of scores leaves DCG/NDCG unchanged, ignoring ties).
Some sources use linear gain \(g(\mathrm{rel}) = \mathrm{rel}\) instead.
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.metrics import ndcg_score
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
# Visualize the discount curve and the (exponential) gain function
k_max = 10
positions = np.arange(1, k_max + 1)
discounts = 1.0 / np.log2(positions + 1)
fig = px.line(
x=positions,
y=discounts,
markers=True,
title="Position discount used by DCG: 1 / log2(rank + 1)",
labels={"x": "rank (1 = top result)", "y": "discount weight"},
)
fig.show()
rels = np.arange(0, 5)
gains = 2**rels - 1
fig = px.bar(
x=rels,
y=gains,
title="Common gain function: g(rel) = 2^rel - 1 (graded relevance)",
labels={"x": "relevance grade (rel)", "y": "gain"},
)
fig.show()
3) IDCG and NDCG: make DCG comparable across queries#
Raw DCG depends on the query’s “label mass” (how many relevant items exist, and how relevant they are).
So we normalize by the best possible DCG for that query:
\(\mathrm{IDCG}@k\) (Ideal DCG@k): compute DCG@k after sorting items by true relevance (best-first)
\(\mathrm{NDCG}@k\): scale to \([0, 1]\)
Edge case: if \(\mathrm{IDCG}@k = 0\) (all items have relevance 0), we define \(\mathrm{NDCG}@k = 0\).
With multiple queries, you typically report the mean NDCG@k over queries.
# Worked example: one query with 8 candidate items
doc_ids = np.array(list("ABCDEFGH"))
y_true = np.array([3, 2, 3, 0, 1, 2, 0, 1])
y_score = np.array([0.60, 0.20, 0.80, 0.40, 0.10, 0.30, 0.05, 0.70])
k = 5
order_pred = np.argsort(-y_score, kind="mergesort")
order_ideal = np.argsort(-y_true, kind="mergesort")
rank = np.arange(1, len(doc_ids) + 1)
discount = 1.0 / np.log2(rank + 1)
def gain_exp(rel):
return 2**rel - 1
rel_pred = y_true[order_pred]
gain_pred = gain_exp(rel_pred)
contrib_pred = gain_pred * discount
dcg_k = contrib_pred[:k].sum()
rel_ideal = y_true[order_ideal]
gain_ideal = gain_exp(rel_ideal)
contrib_ideal = gain_ideal * discount
idcg_k = contrib_ideal[:k].sum()
ndcg_k = dcg_k / idcg_k if idcg_k > 0 else 0.0
print(f"Predicted order: {''.join(doc_ids[order_pred])}")
print(f"Ideal order: {''.join(doc_ids[order_ideal])}")
print(f"DCG@{k}: {dcg_k:.4f}")
print(f"IDCG@{k}: {idcg_k:.4f}")
print(f"NDCG@{k}: {ndcg_k:.4f}")
sk_ndcg = ndcg_score(y_true[None, :], y_score[None, :], k=k)
print(f"sklearn ndcg_score@{k}: {sk_ndcg:.4f}")
# Any strictly monotonic transform of scores keeps the ranking the same
y_score_scaled = 10 * y_score + 5
sk_scaled = ndcg_score(y_true[None, :], y_score_scaled[None, :], k=k)
print(f"sklearn ndcg_score@{k} after scaling scores: {sk_scaled:.4f}")
Predicted order: CHADFBEG
Ideal order: ACBFEHDG
DCG@5: 12.2915
IDCG@5: 14.5954
NDCG@5: 0.8421
sklearn ndcg_score@5: 0.8269
sklearn ndcg_score@5 after scaling scores: 0.8269
# Where does DCG@k come from? Break it into per-rank contributions.
docs_pred = doc_ids[order_pred]
scores_pred = y_score[order_pred]
rels_pred = y_true[order_pred]
gains_pred = gain_exp(rels_pred)
contribs_pred = gains_pred * discount
docs_ideal = doc_ids[order_ideal]
rels_ideal = y_true[order_ideal]
gains_ideal = gain_exp(rels_ideal)
contribs_ideal = gains_ideal * discount
rows = slice(0, k)
fig = go.Figure(
data=[
go.Table(
header=dict(
values=["rank", "doc", "y_true", "score", "gain", "discount", "gain*discount"],
align="left",
),
cells=dict(
values=[
(np.arange(1, len(doc_ids) + 1)[rows]).tolist(),
docs_pred[rows].tolist(),
rels_pred[rows].tolist(),
np.round(scores_pred[rows], 3).tolist(),
gains_pred[rows].tolist(),
np.round(discount[rows], 3).tolist(),
np.round(contribs_pred[rows], 3).tolist(),
],
align="left",
),
)
]
)
fig.update_layout(title=f"Top-{k} contributions to DCG@{k} (predicted ranking)")
fig.show()
fig = go.Figure()
fig.add_trace(
go.Bar(
x=np.arange(1, k + 1),
y=contribs_pred[:k],
name="predicted",
)
)
fig.add_trace(
go.Bar(
x=np.arange(1, k + 1),
y=contribs_ideal[:k],
name="ideal",
)
)
fig.update_layout(
title=f"Per-rank contributions: DCG@{k} vs IDCG@{k}",
barmode="group",
xaxis_title="rank",
yaxis_title="gain(rel) / log2(rank+1)",
)
fig.show()
# NDCG as a function of k (for this one query)
k_list = np.arange(1, len(doc_ids) + 1)
ndcgs = []
for kk in k_list:
dcg = contribs_pred[:kk].sum()
idcg = contribs_ideal[:kk].sum()
ndcgs.append(dcg / idcg if idcg > 0 else 0.0)
ndcgs = np.array(ndcgs)
fig = px.line(
x=k_list,
y=ndcgs,
markers=True,
title="NDCG@k for a single query",
labels={"x": "k (cutoff)", "y": "NDCG@k"},
)
fig.update_yaxes(range=[0, 1.05])
fig.show()
4) From-scratch NumPy implementation#
scikit-learn expects:
y_true: array of shape(n_queries, n_docs)with non-negative relevance gradesy_score: array of shape(n_queries, n_docs)with arbitrary real-valued model scores
We’ll implement:
dcg_at_k_numpy(...)→ DCG per queryndcg_at_k_numpy(...)→ NDCG per querymean_ndcg_at_k_numpy(...)→ average NDCG across queries
def _as_2d_same_shape(y_true, y_score):
y_true = np.asarray(y_true, dtype=float)
y_score = np.asarray(y_score, dtype=float)
if y_true.shape != y_score.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_score{y_score.shape}")
if y_true.ndim == 1:
y_true = y_true[None, :]
y_score = y_score[None, :]
elif y_true.ndim != 2:
raise ValueError("expected 1D or 2D arrays")
if np.any(y_true < 0):
raise ValueError("y_true must be non-negative for DCG/NDCG")
return y_true, y_score
def _gain(relevance, scheme="exponential"):
relevance = np.asarray(relevance, dtype=float)
if scheme == "exponential":
return np.power(2.0, relevance) - 1.0
if scheme == "linear":
return relevance
raise ValueError("scheme must be 'exponential' or 'linear'")
def dcg_at_k_numpy(y_true, y_score, k=None, gain_scheme="exponential"):
y_true, y_score = _as_2d_same_shape(y_true, y_score)
n_queries, n_docs = y_true.shape
if k is None:
k = n_docs
k = int(k)
if k <= 0:
raise ValueError("k must be >= 1")
k = min(k, n_docs)
order = np.argsort(-y_score, axis=1, kind="mergesort")
y_sorted = np.take_along_axis(y_true, order, axis=1)
gains = _gain(y_sorted[:, :k], scheme=gain_scheme)
discounts = 1.0 / np.log2(np.arange(2, k + 2))
return np.sum(gains * discounts[None, :], axis=1)
def idcg_at_k_numpy(y_true, k=None, gain_scheme="exponential"):
y_true = np.asarray(y_true, dtype=float)
if y_true.ndim == 1:
y_true = y_true[None, :]
if y_true.ndim != 2:
raise ValueError("expected 1D or 2D arrays")
n_queries, n_docs = y_true.shape
if k is None:
k = n_docs
k = int(k)
if k <= 0:
raise ValueError("k must be >= 1")
k = min(k, n_docs)
ideal_order = np.argsort(-y_true, axis=1, kind="mergesort")
y_ideal = np.take_along_axis(y_true, ideal_order, axis=1)
gains = _gain(y_ideal[:, :k], scheme=gain_scheme)
discounts = 1.0 / np.log2(np.arange(2, k + 2))
return np.sum(gains * discounts[None, :], axis=1)
def ndcg_at_k_numpy(y_true, y_score, k=None, gain_scheme="exponential"):
y_true, y_score = _as_2d_same_shape(y_true, y_score)
dcg = dcg_at_k_numpy(y_true, y_score, k=k, gain_scheme=gain_scheme)
idcg = idcg_at_k_numpy(y_true, k=k, gain_scheme=gain_scheme)
return np.divide(dcg, idcg, out=np.zeros_like(dcg), where=idcg > 0)
def mean_ndcg_at_k_numpy(y_true, y_score, k=None, gain_scheme="exponential"):
return float(ndcg_at_k_numpy(y_true, y_score, k=k, gain_scheme=gain_scheme).mean())
def dcg_at_k_expected_ties_1d(y_true, y_score, k=None, gain_scheme="exponential"):
"""Expected DCG@k under uniform random tie-breaking for equal scores (1 query).
Useful for understanding how ties can change DCG/NDCG.
"""
y_true = np.asarray(y_true, dtype=float)
y_score = np.asarray(y_score, dtype=float)
if y_true.ndim != 1 or y_score.ndim != 1:
raise ValueError("expected 1D arrays")
if y_true.shape != y_score.shape:
raise ValueError("shape mismatch")
if np.any(y_true < 0):
raise ValueError("y_true must be non-negative")
n = y_true.size
if k is None:
k = n
k = int(k)
if k <= 0:
raise ValueError("k must be >= 1")
k = min(k, n)
order = np.argsort(-y_score, kind="mergesort")
y_true_sorted = y_true[order]
y_score_sorted = y_score[order]
gains_sorted = _gain(y_true_sorted, scheme=gain_scheme)
discounts = 1.0 / np.log2(np.arange(2, n + 2))
dcg = 0.0
start = 0
while start < k:
end = start + 1
while end < n and y_score_sorted[end] == y_score_sorted[start]:
end += 1
group_size = end - start
sum_discounts_in_top_k = discounts[start : min(end, k)].sum()
dcg += gains_sorted[start:end].sum() * (sum_discounts_in_top_k / group_size)
start = end
return float(dcg)
# Sanity check against scikit-learn (ties are rare here due to continuous scores)
n_queries, n_docs = 30, 40
y_true_rand = rng.integers(0, 4, size=(n_queries, n_docs))
y_score_rand = rng.normal(size=(n_queries, n_docs))
for k in [1, 3, 5, 10, None]:
sk = ndcg_score(y_true_rand, y_score_rand, k=k)
np_impl = mean_ndcg_at_k_numpy(y_true_rand, y_score_rand, k=k)
print(f"k={str(k):>4} | sklearn={sk:.6f} | numpy={np_impl:.6f} | abs diff={abs(sk-np_impl):.2e}")
k= 1 | sklearn=0.477778 | numpy=0.371429 | abs diff=1.06e-01
k= 3 | sklearn=0.463501 | numpy=0.357834 | abs diff=1.06e-01
k= 5 | sklearn=0.487567 | numpy=0.378214 | abs diff=1.09e-01
k= 10 | sklearn=0.495100 | numpy=0.391870 | abs diff=1.03e-01
k=None | sklearn=0.801454 | numpy=0.736041 | abs diff=6.54e-02
5) Ties and other pitfalls#
Ties in y_score#
If multiple items share exactly the same score, the ranking is not uniquely defined. Different tie-breaking rules can lead to different DCG/NDCG values.
Libraries often implement a tie-aware variant (expected value under random permutations) when ignore_ties=False.
All-zero relevance#
If all relevances are zero for a query, then \(\mathrm{IDCG}@k = 0\) and NDCG is defined as 0.
Label scaling#
With exponential gains \(2^{\mathrm{rel}} - 1\), going from relevance 2 to 3 matters more than 0 to 1. That’s often what you want in IR (highly relevant docs should dominate), but you should choose grades deliberately.
# Demonstrate how ties can create a range of possible NDCG values
y_true_tie = np.array([3, 2, 1, 0, 0])
y_score_tie = np.array([0.9, 0.8, 0.8, 0.8, 0.1]) # tie among 3 items
k = 5
# Stable tie-breaking (argsort with kind="mergesort")
ndcg_simple = ndcg_at_k_numpy(y_true_tie, y_score_tie, k=k)[0]
# Best/worst tie-breaking inside the tied group (same primary score ordering)
order_best = np.lexsort((-y_true_tie, -y_score_tie))
order_worst = np.lexsort((y_true_tie, -y_score_tie))
def ndcg_from_order(order):
rel_sorted = y_true_tie[order]
discounts = 1.0 / np.log2(np.arange(2, rel_sorted.size + 2))
gains = (2**rel_sorted - 1) * discounts
dcg = gains[:k].sum()
idcg = idcg_at_k_numpy(y_true_tie[None, :], k=k)[0]
return float(dcg / idcg) if idcg > 0 else 0.0
ndcg_best = ndcg_from_order(order_best)
ndcg_worst = ndcg_from_order(order_worst)
dcg_expected = dcg_at_k_expected_ties_1d(y_true_tie, y_score_tie, k=k)
idcg = idcg_at_k_numpy(y_true_tie[None, :], k=k)[0]
ndcg_expected = dcg_expected / idcg if idcg > 0 else 0.0
print(f"NDCG@{k} with stable tie-breaking: {ndcg_simple:.4f}")
print(f"NDCG@{k} best-case within ties: {ndcg_best:.4f}")
print(f"NDCG@{k} worst-case within ties: {ndcg_worst:.4f}")
print(f"NDCG@{k} expected under random ties: {ndcg_expected:.4f}")
sk_tie_aware = ndcg_score(y_true_tie[None, :], y_score_tie[None, :], k=k, ignore_ties=False)
sk_ignore_ties = ndcg_score(y_true_tie[None, :], y_score_tie[None, :], k=k, ignore_ties=True)
print(f"sklearn ndcg_score@{k} (ignore_ties=False): {sk_tie_aware:.4f}")
print(f"sklearn ndcg_score@{k} (ignore_ties=True): {sk_ignore_ties:.4f}")
fig = px.bar(
x=["worst", "expected", "deterministic", "best"],
y=[ndcg_worst, ndcg_expected, ndcg_simple, ndcg_best],
title=f"Tie handling can change NDCG@{k}",
labels={"x": "tie handling", "y": f"NDCG@{k}"},
)
fig.update_yaxes(range=[0, 1.05])
fig.show()
NDCG@5 with stable tie-breaking: 1.0000
NDCG@5 best-case within ties: 1.0000
NDCG@5 worst-case within ties: 0.9360
NDCG@5 expected under random ties: 0.9669
sklearn ndcg_score@5 (ignore_ties=False): 0.9579
sklearn ndcg_score@5 (ignore_ties=True): 0.9159
6) Using NDCG when optimizing a simple ranking model#
NDCG depends on sorting, so it is not a smooth, differentiable function of the model parameters. In practice, learning-to-rank systems usually optimize a surrogate loss and track NDCG for evaluation / model selection.
We’ll compare two simple linear scoring approaches:
Pointwise regression (MSE)#
Treat each (query, doc) as an independent training example and predict the relevance grade:
Pairwise logistic loss (RankNet-style)#
For each query, form pairs \((i, j)\) where \(y_{qi} > y_{qj}\) and encourage \(s_{qi} > s_{qj}\):
Both produce scores you can rank; we’ll track mean NDCG@k during training.
# Synthetic learning-to-rank dataset
n_queries = 300
n_docs = 10
n_features = 6
X = rng.normal(size=(n_queries, n_docs, n_features))
w_star = rng.normal(size=(n_features,))
latent = X @ w_star + 0.2 * rng.normal(size=(n_queries, n_docs))
# Convert latent scores to graded relevance 0..3 *within each query*
q50 = np.quantile(latent, 0.50, axis=1, keepdims=True)
q75 = np.quantile(latent, 0.75, axis=1, keepdims=True)
q90 = np.quantile(latent, 0.90, axis=1, keepdims=True)
y_rel = np.zeros_like(latent, dtype=int)
y_rel += latent >= q50
y_rel += latent >= q75
y_rel += latent >= q90
# Train/validation split by query
perm = rng.permutation(n_queries)
n_train = int(0.8 * n_queries)
train_idx, val_idx = perm[:n_train], perm[n_train:]
X_train, y_train = X[train_idx], y_rel[train_idx]
X_val, y_val = X[val_idx], y_rel[val_idx]
k_eval = 5
def score_linear(X, w):
# X: (n_queries, n_docs, n_features) -> scores: (n_queries, n_docs)
return np.tensordot(X, w, axes=[2, 0])
w0 = rng.normal(scale=0.1, size=(n_features,))
baseline_val = mean_ndcg_at_k_numpy(y_val, score_linear(X_val, w0), k=k_eval)
print(f"Baseline mean NDCG@{k_eval} on val (random weights): {baseline_val:.4f}")
Baseline mean NDCG@5 on val (random weights): 0.4260
def train_pointwise_mse(
X_train,
y_train,
X_val,
y_val,
k=5,
lr=0.05,
l2=1e-3,
epochs=60,
):
n_features = X_train.shape[2]
w = rng.normal(scale=0.1, size=n_features)
X_flat = X_train.reshape(-1, n_features)
y_flat = y_train.reshape(-1).astype(float)
history = []
for epoch in range(epochs):
preds = X_flat @ w
err = preds - y_flat
grad = (X_flat.T @ err) / y_flat.size + l2 * w
w -= lr * grad
train_ndcg = mean_ndcg_at_k_numpy(y_train, score_linear(X_train, w), k=k)
val_ndcg = mean_ndcg_at_k_numpy(y_val, score_linear(X_val, w), k=k)
mse = float(np.mean(err**2))
history.append({"epoch": epoch, "train_ndcg": train_ndcg, "val_ndcg": val_ndcg, "mse": mse})
return w, history
w_mse, hist_mse = train_pointwise_mse(X_train, y_train, X_val, y_val, k=k_eval)
print(f"Final val mean NDCG@{k_eval} (pointwise MSE): {hist_mse[-1]['val_ndcg']:.4f}")
Final val mean NDCG@5 (pointwise MSE): 0.9687
def _sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def _sample_pairs(y_query, max_pairs=40):
# indices (i, j) such that y_i > y_j
i_idx, j_idx = np.where(y_query[:, None] > y_query[None, :])
if i_idx.size == 0:
return np.empty((0,), dtype=int), np.empty((0,), dtype=int)
if i_idx.size > max_pairs:
sel = rng.choice(i_idx.size, size=max_pairs, replace=False)
i_idx, j_idx = i_idx[sel], j_idx[sel]
return i_idx.astype(int), j_idx.astype(int)
def make_pair_dataset(y_train, max_pairs_per_query=40):
qs, is_, js = [], [], []
for q in range(y_train.shape[0]):
i_idx, j_idx = _sample_pairs(y_train[q], max_pairs=max_pairs_per_query)
if i_idx.size:
qs.append(np.full(i_idx.size, q, dtype=int))
is_.append(i_idx)
js.append(j_idx)
if not qs:
return np.empty((0,), dtype=int), np.empty((0,), dtype=int), np.empty((0,), dtype=int)
return np.concatenate(qs), np.concatenate(is_), np.concatenate(js)
pair_q, pair_i, pair_j = make_pair_dataset(y_train, max_pairs_per_query=40)
print(f"Pairwise dataset size: {pair_q.size} pairs")
def train_pairwise_logistic(
X_train,
y_train,
X_val,
y_val,
pair_q,
pair_i,
pair_j,
k=5,
lr=0.2,
l2=1e-3,
epochs=60,
):
n_features = X_train.shape[2]
w = rng.normal(scale=0.1, size=n_features)
history = []
for epoch in range(epochs):
# x_diff: (n_pairs, n_features)
x_diff = X_train[pair_q, pair_i] - X_train[pair_q, pair_j]
d = x_diff @ w
# log(1 + exp(-d)) in a stable way
loss = float(np.mean(np.logaddexp(0.0, -d)) + 0.5 * l2 * np.sum(w**2))
p = _sigmoid(d)
grad = (x_diff * (p - 1.0)[:, None]).mean(axis=0) + l2 * w
w -= lr * grad
train_ndcg = mean_ndcg_at_k_numpy(y_train, score_linear(X_train, w), k=k)
val_ndcg = mean_ndcg_at_k_numpy(y_val, score_linear(X_val, w), k=k)
history.append({"epoch": epoch, "train_ndcg": train_ndcg, "val_ndcg": val_ndcg, "loss": loss})
return w, history
w_rank, hist_rank = train_pairwise_logistic(
X_train,
y_train,
X_val,
y_val,
pair_q,
pair_i,
pair_j,
k=k_eval,
)
print(f"Final val mean NDCG@{k_eval} (pairwise logistic): {hist_rank[-1]['val_ndcg']:.4f}")
Pairwise dataset size: 7920 pairs
Final val mean NDCG@5 (pairwise logistic): 0.9810
# Compare training curves (mean NDCG@k)
epochs = [h["epoch"] for h in hist_mse]
mse_train = [h["train_ndcg"] for h in hist_mse]
mse_val = [h["val_ndcg"] for h in hist_mse]
rank_train = [h["train_ndcg"] for h in hist_rank]
rank_val = [h["val_ndcg"] for h in hist_rank]
fig = go.Figure()
fig.add_trace(go.Scatter(x=epochs, y=mse_val, mode="lines", name="val NDCG (pointwise MSE)"))
fig.add_trace(go.Scatter(x=epochs, y=rank_val, mode="lines", name="val NDCG (pairwise logistic)"))
fig.add_trace(go.Scatter(x=epochs, y=mse_train, mode="lines", name="train NDCG (pointwise MSE)", line=dict(dash="dot")))
fig.add_trace(go.Scatter(x=epochs, y=rank_train, mode="lines", name="train NDCG (pairwise logistic)", line=dict(dash="dot")))
fig.add_hline(y=baseline_val, line_dash="dash", line_color="gray", annotation_text="random baseline (val)")
fig.update_layout(
title=f"Learning-to-rank toy example: mean NDCG@{k_eval} during training",
xaxis_title="epoch",
yaxis_title=f"mean NDCG@{k_eval}",
)
fig.update_yaxes(range=[0, 1.05])
fig.show()
Pros / cons and when to use NDCG#
Pros#
Works with graded relevance (not just relevant/irrelevant)
Emphasizes the top of the ranking via the discount and cutoff \(k\)
Normalized to \([0, 1]\) so it’s comparable across queries
Depends only on ranking order (invariant to strictly monotonic score transforms, ignoring ties)
Cons#
Non-smooth due to sorting → not directly optimized by standard gradient descent (use surrogates or black-box optimization)
Requires a query → candidate set structure; not a drop-in metric for standard single-label classification
Tie handling, gain definition, and choice of \(k\) can change results
If \(\mathrm{IDCG}@k = 0\) the metric is defined by convention (commonly 0)
Good fits#
Search ranking (documents / products)
Recommender systems where you evaluate a ranked slate
Learning-to-rank model evaluation (pairwise/listwise methods)
References#
Järvelin, K. & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM TOIS.
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html
Burges, C. et al. (2005). Learning to Rank using Gradient Descent (RankNet).